22 research outputs found

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

    An efficient multiple precision floating-point Multiply-Add Fused unit

    Get PDF
    Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    System on fabrics architecture using distributed computing

    Get PDF
    This paper describes a novel, distributed sensor network with parallel processing capability based on the Instruction Systolic Array (ISA). A new computing paradigm is introduced where spatially distributed sensors are closely coupled to processing elements and the whole array forms a parallel computer. This may find applications in wearable devices for sensing the position and other metrics of a human body and rapidly processing that data. A new programming model to implement the distributed computer on fabrics is described. The fabric-based distributed computing concept has been validated using a number of parallel applications including a real-time shape sensing and reconstruction application. The exemplar wearable system based on the ISA concept has been realized using off-the-shelf microcontrollers and sensors. Results show that the application executes on the prototype ISA implementation in real time thus confirming the viability of the proposed architecture for fabric-resident computing devices

    Parametric data-parallel architectures for TLM acceleration

    Get PDF
    We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator for the TLM algorithm. Architecture-level experimentation demonstrates an order of magnitude complexity reduction for vector lengths of 16 32-bit single-precision elements. We envisage the proposed architecture replicated in a SOC environment thus, forming a multiprocessor system capable of tapping parallelism at the thread level as well as the data level

    Shape reconstruction using instruction systolic array

    Get PDF
    This paper describes a novel, 2D mesh architecture prototype based on the Instruction Systolic Array (ISA) paradigm for distributed computing on fabrics. We discuss a real-time shape sensing and reconstruction application executing on this architecture and demonstrate a physical design for a wearable system based on ISA concept constructed out of off-the-shelf microcontrollers and sensors. Results demonstrate the application executes in 39 ms on our prototype ISA implementation thus confirming the viability of the proposed architecture for fabric-resident computing devices

    Feasibility of imaging photoplethysmography

    Get PDF
    Contact and spot measurement have limited the application of photoplethysmography (PPG), thus an imaging PPG system comprising a digital CMOS camera and three wavelength light-emitting diodes (LEDs) is developed to detect the blood perfusion in tissue. With the means of the imaging PPG system, the ideally contactless monitoring with larger field of view and the different depth of tissue by applying multi- wavelength illumination can be achieved to understand the blood perfusion change. Corresponding to the individual wavelength LED illumination, the PPG signals can be derived in the both transmission and reflection modes, respectively. The outcome explicitly reveals the imaging PPG is able to detect blood perfusion in a illuminated tissue and indicates the vascular distribution and the blood cell response to individual wavelength LED. The functionality investigation leads to the engineering model for 3-D visualized blood perfusion of tissue and the development of imaging PPG tomography

    Efficient protection of the pipeline core for safety-critical processor-based systems

    Get PDF
    The increasing number of safety-critical commercial applications has generated a need for components with high levels of reliability. As CMOS process sizes continue to shrink, the reliability of ICs is negatively affected since they become more sensitive to transient faults. New circuit designs must take this fact into consideration, and incorporate adequate protection against the effects of transient faults. This paper presents a novel method for protecting the pipelined execution unit of an embedded processor. It is based on a self-configured architecture with hybrid redundancy that can mask single and multiple errors, which can occur on storage elements due to transient or permanent faults. This concept can be easily applied to any processing architecture of this nature with a high safety integrity level. Results from error-injection experiments are also reported that show that this design can maintain a non-interrupted and failure-free operation under single and double errors with a probability that exceeds 99.4%

    Study of the effects of SEU-induced faults on a pipeline protected microprocessor

    Get PDF
    This paper presents a detailed analysis of the behavior of a novel fault-tolerant 32-bit embedded CPU as compared to a default (non-fault-tolerant) implementation of the same processor during a fault injection campaign of single and double faults. The fault-tolerant processor tested is characterized by per-cycle voting of microarchitectural and the flop-based architectural states, redundancy at the pipeline level, and a distributed voting scheme. Its fault-tolerant behavior is characterized for three different workloads from the automotive application domain. The study proposes statistical methods for both the single and dual fault injection campaigns and demonstrates the fault-tolerant capability of both processors in terms of fault latencies, the probability of fault manifestation, and the behavior of latent faults

    A system-on-chip vector multiprocessor for transmission line modelling acceleration

    Get PDF
    We discuss a configurable, System-on-Chip vector multiprocessor for accelerating the Transmission Line Modeling (TLM) algorithm with an architecture capable of exploiting the two primary forms of parallelism in the code, thread and data level parallelism. Theoretical results demonstrate an order of magnitude reduction in the dynamic instruction count for a scalar-processor/vector-coprocessor configuration at a vector length of sixteen 32-bit singleprecision elements. Furthermore, a multi-vector SoC architecture consisting of ten such vector accelerators provides a near-linear theoretical performance benefit of the order of 88% in three out of four benchmark configurations which is orthogonal to the benefit realized by vectorization alone. We discuss in detail this potent architecture and present implementation data for the 2-way multi-processor VLSI macrocell
    corecore